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Abstract 

In  this  note  we  describe  a  method  for  recursively  estimating  the  depth  of  a  scene 
from  a  sequence  of  images.  The  input  to  the  estimator  are  brightness  values  at  a 
number  of  locations  of  a  grid  in  a  video  image,  and  the  output  is  the  relative  (scaled) 
depth  corresponding  to  each  image-point.  The  estimator  is  invariant  with  respect  to 
the  motion  of  the  viewer,  in  the  sense  that  the  motion  parameters  are  not  part  of 
the  state  of  the  estimator  and  therefore  the  estimates  do  not  depend  on  motion  as 
long  as  there  is  enough  parallax  (the  translational  velocity  is  nonzero).  This  scheme 
is  a  “direct”  version  of  an  other  algorithm  previously  presented  by  the  authors  for 
estimating  depth  from  point-feature  correspondence  independent  of  motion. 

Consider  a  sequence  of  images,  consisting  of  a  map  from  some  location  on  a  pixel  grid  x 
and  a  particular  time  instant  t  onto  a  brightness  value  in  IR+. 

1 :  1R2x2  x  IR+  — ♦  R+ 

CM)  ^  J(M)*  (1) 

In  practice  the  brightness  values  are  quantized,  and  we  will  lump  the  effects  of  the  quanti¬ 
zation  errors  and  other  sensor  noises  into  an  additive  Gaussian  noise  component,  so  that  we 
measure 

/(x,t)  +  n/(x,  t)  nIeAf(  0,cr).  (2) 

As  the  camera  moves  relative  to  the  scene,  the  brightness  patches  on  the  image  plane  move 
accordingly.  Under  somewhat  restrictive  circumstances,  we  can  assume  that  the  brightness 
of  each  point  in  the  scene  remains  unchanged.  This  assumption  can  be  violated  in  a  number 
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of  cases  (specularities,  reflections,  non-uniform  lightening  etc.),  but  is  by  and  large  satisfied 
in  many  practical  circumstances  [4]. 

The  image  brightness  constancy  assumption  corresponds  to  enforcing  that  the  total  time- 
derivative  of  the  image  at  each  pixel  location  remains  constant: 

t)  =  0  Vx6lR2;t€lR;  r  e  N  (3) 

where  r  indicate  a  particular  level  of  resolution.  By  expanding  the  above  derivative  into  its 
spatial  gradient  and  its  temporal  derivative  we  get 

f)  T  ( 

Vx/r(x,  *)*(*) +  SL^  =  0  (4) 

where  x  is  the  velocity  of  the  brightness  pattern  at  the  image  location  x  =  [x  y]T  (optical 
flow).  The  optical  flow  can  be  loosely  related  to  the  velocity  of  the  projection  of  any  particular 
point  X  in  the  scene  (motion  field).  In  particular,  under  the  assumption  that  the  relative 
motion  between  the  viewer  and  the  scene  is  rigid  with  translational  velocity  V  and  rotational 
velocity  f 1,  the  motion  field  can  be  written  as 


x(t)  = 


LX; 


-A  I  B 


V(t) 

n(t) 


where 


A  = 


1  0  — x 

0  1  -y 


B  = 


—xy  1  +  x 2  —y 
—  l  —  y2  xy  x 


The  above  equation  can  be  written  more  concisely  as 

x  =  C(x,  d) 


1/ 

O 


where 


d  = 


is  the  inverse  depth  of  the  point  with  coordinates  X  and 

C(x,  d)  =  [dA  |  B]. 


(5) 

(6) 

(7) 

(8) 

(9) 


Under  the  assumption  that  the  scene  has  Lambertian  properties  and  constant  illumination, 
we  can  assume  that  the  optical  flow  and  the  motion  field  coincide,  x  =  x,  so  that  we  can 
substitute  (5)  into  (4)  in  order  to  get 


VxT(x,  t)C(x,  d) 


V 

0 


dlr(x) 

dt 


=  0. 


(10) 


Given  a  number  of  locations  on  the  image  plane,  for  example  on  a  pixel  grid, 

x*  Vi  =  1  ...  n 


(ii) 
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we  can  collect  the  constraints  in  (10)  written  at  x*  V  *  =  1 . . .  n  >  6: 


Vxi/r(x1,t)C(x1,  d1) 

'  V  ' 

r  dirix1)  1 
dt 

n 

+ 

dlr(xn) 

L  dt  J 

Vxn/r(xn,  t)C(xn,  dn ) 

(12) 


and  solve  for  the  motion  parameters  V,  fl  as  a  function  of  the  image  derivatives  and  the 
inverse  depth  d  in  a  least-squares  sense: 


V 

n 


£/f(x,  V xIr , 


d) 


<9Ir(x) 

dt 


(13) 


where 


£(x,YxIr,d) 


Vxi/r(x1,i)C(x1,d1) 
Vxn/r(x",  t)C(xn,  dn) 


<9Ir(x) 

dt 


dlr(x1)  "I 
dt 


dlr(xn) 

dt 


(14) 


(15) 


and  f  denotes  the  pseudo-inverse.  If  we  substitute  the  estimate  of  the  motion  parameters  V 
and  0  back  into  equation  (12),  written  for  a  number  of  points  larger  than  6,  we  end  up  with 
a  subspace  constraint  involving  only  the  inverse  depths  p*  and  the  derivative  of  the  image 
brightness: 

gg\x,VxIr,d)-^^-  =  0  (16) 


which  can  be  written  as 


where 


/*»  I  /  T - 7  T  7\ 

y-{x,  vxI r,a) 


dlr(x) 

dt 


=  0 


q 1  =  idn  -  ggi 


(IT) 

(18) 


where  Idn  is  the  identity  matrix.  Now,  since  we  measure  the  image  brightness  at  each  level 
of  resolution,  modulo  some  noise  that  we  model  as  a  white,  zero-mean  and  Gaussian,  we  can 
view  the  above  equation  as  a  nonlinear,  implicit  dynamical  model  with  parameters  p  on  an 
n— dimensional  sphere: 


^(x,VxIr,d)^  =  0 

Yr(x,t)  =  Ir{x,t )  +  n/(x,  t) 


peS""1  n/(x,f)€--V(0,E). 


(19) 


The  normalization  of  the  depth  parameters  p  is  due  to  the  inherent  scale-factor  ambiguity  [2] . 

The  above  is  a  dynamical  model  in  nonlinear  implicit  form,  and  estimating  depth  amounts 
to  identifying  its  parameters  p  £  S”-1.  The  Essential  Filter  [2]  is  a  local  recursive  observer 
that  accomplishes  the  task.  Therefore  we  could  implement  one  essential  filter  at  each  level 
of  resolution  r,  and  interconnect  them  by  propagating  the  estimates  across  scales  starting 
from  the  coarser  level. 
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Note  that  this  filter  estimates  the  depth  at  each  grid  point  x,-  (at  all  pixels,  in  the  limit 
in  which  gradients  are  computed  on  the  whole  image),  relative  to  the  moving  camera  but 
independent  of  the  motion  of  the  camera.  This  holds  for  any  motion  such  that  V  7^  0,  which 
is  a  non-observable  configuration  for  the  depth  parameters  [2].  The  motion  components 
have  been  decoupled  from  the  estimation  process  in  deriving  the  subspace  constraint.  This 
method  is  inspired  by  [1],  who  derive  a  similar  subspace  constraint  for  the  direction  of 
translation.  However,  they  do  not  view  the  constraint  as  a  dynamic  model,  and  formulate 
an  optimization  task  between  each  two  views  which  they  solve  by  exhaustive  search  over  the 
all  possible  directions  of  translation. 

This  method  is  a  direct  extension  of  the  work  presented  in  [3],  where  depth  is  estimated 
recursively  and  independent  of  motion  from  a  sequence  of  feature-point  correspondences. 
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